Overlapping of genes in the human genome.

Overlapping genes are relatively common in DNA and RNA viruses. There are several examples in bacterial and eukaryotic genomes, but, in general, overlapping genes are quite rare in organisms other than viruses. There have been a few reports of overlapping genes in mammalian genomes. The present study identified all of the overlapping loci and overlapping exons in every chromosome of the human genome using a public database. The total number of overlapping loci on the same and opposite strands was 949 and 743, respectively. Similarly, in every chromosome, the instances in which two loci were located on the same strand was similar to the number of 2 genes observed on opposite strands, except for chromosome 5. The number of 2 exons located on the same strand was higher than that for 2 exons located on opposite strands, indicating the presence of many comprehensive-type overlaps. The mean percentage of overlapping exons on opposite strands in each chromosome was 3.3%, suggesting that parts of the nucleotide sequences of 26,501 exons are used to produce 2 transcribed products from each strand. The ratio of the number of overlapping regions to chromosomal length revealed that, on chromosomes 22, 17 and 19, ratios were high for both types of 2 loci, with exons located on the same and opposite strands. Ratios were low on chromosomes Y, 13 and 18. These results show that all overlapping types are distributed throughout the human genome, but that distributions differ for each chromosome.


INtrODuctION
Publication of the human genome sequence marked a significant milestone in the field of biology. Significant biological information can also be gained from the analysis of genome organization. Approximately 30,000 protein-coding genes are thought to be present in the human genome, and the positions of genes within chromosomes is currently being established (1)(2)(3)(4).
DNA sequences can code for more than one gene product by using different reading frames or different initiation codons. Overlapping genes are relatively common in DNA and RNA viruses (5)(6)(7)(8)(9). While several examples exist in bacterial and eukaryotic genomes, overlapping genes appear to be relatively rare in non-viral organisms and few reports have described overlapping genes in mammalian genomes (10)(11)(12). Some studies have demonstrated that the overlapping of genes differs among species and have inferred that this can be attributed to differences in evolutionary histories (13)(14)(15).
Given that both strands of the human genome are used for transcription, two types of overlapping are thus possible; 2 genes overlapping on the same strand, and 2 genes overlapping on opposing strands. Furthermore, overlapping patterns can be classified by the relative positions of the 2 genes. Little exact information is available regarding overlapping genes in the human genome and their associ-REVIEW ARTICLE ated overlapping patterns. To investigate this phenomenon and the inherent biological information contained therein further, the number of overlapping genes and their patterns were examined in every chromosome of the human genome.

MethODs
The positions and sequences of each gene were obtained from the National Center for Biotechnology Information (NCBI) database (build 31; published January 15, 2003; http://www.ncbi.nlm.nih.gov/genome/guide/human/ HsStats.html). Each locus was defined using both Locus-Link and RefSeq, using gene symbols and names established by the nomenclature committee for the genome (http://www.gene.ucl.ac.uk/nomenclature/).
In the LocusLink report, symbols and names were reported under the banner (http://www.ncbi.nlm.nih.gov/). Exons were defined as DNA sequences coding mRNA, rather than considering functions within specific genes or locations within specific genes. This definition allowed for analysis of all possible cases of overlapping. Official gene symbols and names were used as follows: For RefSeq records, symbols were assigned using the LOCUS system. If a symbol had not yet officially been assigned, an interim symbol and name were arbitrarily selected. Arbitrarily selected symbols and names are included at this website (http:// www.ncbi.nlm.nih.gov/LocusLink/collaborators.html).
All loci and exons registered in NCBI build 31 were examined, using the data describing the position of genes on the chromosome. The information of nucleotide sequences and the positions of each nucleotide in the whole human genome were downloaded and stored in EXCEL file format (Microsoft Corporation, Redmond, Washington). All overlapping loci and overlapping exons could be defined according to the start and end positions of each locus and exon. Data for overlapping loci were produced using data of registered loci, while the data for overlapping exons was produced using data distinct coding regions and mRNA sequences.

resuLts
The type of overlap was classified into 8 groups (Fig.  1), with 2 loci or exons on the same strand divided into 4 groups, and the same for those on opposite strands. We classified the patterns of overlapping genes by considering the strand-location of respective genes. For example, for Groups 1 and 2, both regions are on the positive (sense) strand and their mRNAs are transcribed using the negative (antisense) strand as the template. Groups 1, 2 and 7 were also different from Groups 3, 4, and 8, respectively. Black arrows indicate locus or exon "1". Shaded arrows indicate locus or exon "2". Locus "2" or exon "2" were defined as those with the side of short arm (p) were located downstream of locus 1 or exon 1. A was the length of the flanking region without overlapping, located on the side of short arm (p). B was the length of the overlapping region. C was the length of the flanking region without overlapping, located on the side of long arm (q). a: Schema of 2 loci or exons on the same strand. Groups 1 and 2: both regions are on the positive strand (unidirectional). Groups 3 and 4: both regions are on the negative strand (unidirectional). Groups 2 and 4: region 1 includes region 2 (comprehensive). b: Schema of 2 loci or exons on opposite strands. Group 5: convergent; Group 6: divergent; Groups 7 and 8: comprehensive.
General information for overlapping loci is shown in Table 1. A total of 12,692 loci were present on the positive strand, with 12,442 on the negative strand. The total number of overlapping loci on the same and opposite strands was 949 and 743, respectively. Except for Group 2 of chromosome 5, the number of instances where 2 loci were located on the same strand was similar to the number of 2 loci on opposite strands on every chromosome. This group on chromosome 5 includes the protocadherin (PCDH) cluster located on the positive strand of 5q31. The mean number of overlapping loci was 6.7% of the total loci on each chromosome (range, 2.0-33.7%). The ratio of the number of overlapping loci to chromosome length revealed chromosomal characteristics. On chromosomes 22, 17 and 19, ratios were high when 2 loci were on overlapped both the same and opposite strands. Analysis of overlap type revealed Groups 2 and 6 as occurring relatively frequently.
Given that the organization of several genes has not yet been clarified, insufficient information is currently available for determining the incidence/pervasiveness of overlapping loci in these specific genes. We therefore set about to examine overlapping exons for the human genome as a whole.
The total number of exons on the positive and negative strands was 404,776 and 402,510, respectively ( Table 2). The number of 2 exons located on the same strand (3,843,308) differed substantially from the number of 2 exons on opposite strands (26,501). Interestingly, the number of 2 exons located on the same strands (groups 2 and 4) exceeded the number of exons, with comprehensive-type overlaps (Groups 7 and 8). This indicates the presence of numerous comprehensive-type overlaps on the same strands (2 exons located on the same strand with a smaller exon within the larger exon). The percentage of overlapping exons (out of the total number of exons) on opposite strands within each chromosome ranged from 1.1% to 5.5%. The total number of overlapping exons was 26,501 (3.3%) out of 807,286 exons, which suggests that parts of the nucleotide sequences for 26,501 exons are used to produce 2 transcribed products from each strand.
The ratios for overlapping exons/chromosomal length on chromosomes 22, 19, 14 and 17 were high for 2 exons located on both the same or opposite strands. Ratios were low on chromosomes Y, 13 and 18 for both overlapping types, suggesting that overlapping is not equally distributed among chromosomes. The NIT1/DEDD and ARTS-1/CAST pairs clearly show this overlapping pattern.

DIscussION
Previous reports (5)(6)(7)(8)(9)(10)(11)(12) have counted the number of genes exhibiting the overlapping phenomenon, but no reports have enumerated the number of loci or exons that exhibit this overlap. Furthermore, previous reports (5)(6)(7)(8)(9)(10)(11)(12) have only described this overlap phenomenon for opposite strands. The present strategy offers a valuable method for estimating the number of overlapping genes, as the total number of genes in the human genome is yet uncertain. Because the total number of genes in the human genome was estimated 32,000 in 2001 (1, 2), and subsequently estimated in 2004 to 22,000.
The total number of exons in the human genome has been estimated at approximately 320,000 (8.8/gene) (1, 2), whereas the present data indicate the existence of more than twice that number. This discrepancy is due to different methods of enumerating exons. We simply counted all of the exons in the human genome, without considering how many exons comprise a gene. This method can identify all exons (e.g., more than 2 exons identified in the same region) and avoids confusion due to splicing variants.
Overlapping genes may evolve as a result of extensions of open reading frames (ORF) caused by switching to an upstream start codon, substitutions in start or stop codons, or deletions and frame shifts that eliminate initiation or stop codons (13). The necessity for maintaining 2 functional overlapping genes inevitably constrains the extent to which both genes can become optimally adapted. However, such constraints can be alleviated by duplication of the overlapping gene pair, allowing for independent evolution of each gene in the resulting copies. This means that overlapping genes can thus only survive long evolutionary periods when the overlap confers a selective advantage upon the organism. In viruses, overlapping genes are thought to persist due to the considerable constraints on genome size (7). In nonviral organisms, the potential advantages of overlapping genes are less clear, although co-regulation may be involved (4). Results of a comparative study of overlapping genes in the genomes of two closely related bacteria revealed that many overlapping genes arise due to incidental elongation of the coding region (16). Overlapping genes have generally been thought to be relatively rare in the human genome, but the results of the present study show that they are more abundant than was previously thought. Interestingly, overlapping genes do not appear to be the result of evolutionary pressure to minimize the size of the human genome.
Yelin et al. (17) demonstrated by in vitro experiments that antisense transcription occurs widely in the human ge- nome. The resulting data set of 2,667 sense-antisense pairs was evaluated by microarrays containing strand-specific oligonucleotide probes derived from the region of overlap. Verification of specific cases by Northern blot analysis with strand-specific riboprobes confirmed the occurrence of transcription from both DNA strands. While these authors also predicted the existence of approximately 1,600 senseantisense transcriptional units, transcribed from both DNA strands (13), no overlapping patterns were elucidated. Adachi-N et al. (18) reported that some genes overlap in a head-to-head manner (transcribed in opposite directions), while  recently reported the occurrence of bidirectional gene pairs in some species. However, they did not describe the patterns of the overlapping exons. In our study, this type of overlap was included in the overlapping loci identified. It has also been reported that divergence (bidirectionality) is frequently observed, particularly in genes involved in DNA repair or replication (18). The functional significance of this is unclear, but divergence may permit two genes to share one CpG island for purposes of coordinated expression. In some bidirectional loci, expression of two divergent genes has been found to be coregulated, and promoters exhibiting bidirectional activity have often been observed (20,21). To the best of our knowledge, the phenomenon of overlapping exons is not specific in DNA repair or replication, and further studies are needed to clarify the functional significance of overlapping genes.
Clarification of overlapping genes will facilitate the description of roles for each strand of the human genome and will provide insight into the mechanisms of evolution.
These results show that all overlapping types are distributed throughout the human genome, but that distributions differ for each chromosome.